Machine learning-based clinical decision support system for treatment recommendation and overall survival prediction of hepatocellular carcinoma: a multi-center study

The treatment decisions for patients with hepatocellular carcinoma are determined by a wide range of factors, and there is a significant difference between the recommendations of widely used staging systems and the actual initial treatment choices. Herein, we propose a machine learning-based clinical decision support system suitable for use in multi-center settings. We collected data from nine institutions in South Korea for training and validation datasets. The internal and external datasets included 935 and 1750 patients, respectively. We developed a model with 20 clinical variables consisting of two stages: the first stage which recommends initial treatment using an ensemble voting machine, and the second stage, which predicts post-treatment survival using a random survival forest algorithm. We derived the first and second treatment options from the results with the highest and the second-highest probabilities given by the ensemble model and predicted their post-treatment survival. When only the first treatment option was accepted, the mean accuracy of treatment recommendation in the internal and external datasets was 67.27% and 55.34%, respectively. The accuracy increased to 87.27% and 86.06%, respectively, when the second option was included as the correct answer. Harrell’s C index, integrated time-dependent AUC curve, and integrated Brier score of survival prediction in the internal and external datasets were 0.8381 and 0.7767, 91.89 and 86.48, 0.12, and 0.14, respectively. The proposed system can assist physicians by providing data-driven predictions for reference from other larger institutions or other physicians within the same institution when making treatment decisions.


Feature selection
To enhance the performance of the model and improve its practical applicability in clinical settings, we conducted a feature reduction process, reducing the initially collected 61 pretreatment variables (Supplementary Table 1) to a final selection of 20 variables.In our previous study, we trained cascaded random forest classifiers to recommend each treatment and random survival forest models to predict survival rates after each treatment.During this process, the input variables were sorted based on the feature importance obtained from each model (Supplementary Figure 1-2).
Subsequently, two hepatologists (K.M.K with 22 years of experience and G.H.C with 9 years of experience in treating HCC patients), meticulously reviewed the sorted features and removed redundant variables, resulting in a final selection of 20 variables.Despite being ranked higher in terms of importance, certain variables were excluded for specific reasons.For example, "Presence of HV/IVC" and "bile duct invasion" were omitted due to the limited number of patients, and the "Millan criteria" were excluded as they could potentially overlap with the maximal tumor diameter and tumor number variables, making them redundant in the feature selection process.

ML algorithms Hyperparameter
Logistic regression LogisticRegression(random_state=999) RidgeClassifier(random_state=999) SGDClassifier(random_state=999) After evaluating all classifiers using stratified five-fold cross-validation, the classifiers were sorted based on mean accuracy.The top-performing three, five, and seven classifiers were then selected to train the ensemble voting machine.The ensemble voting classifier is an algorithm that combines the predictions of multiple individual classifiers to make the final prediction.
These classifiers are independently trained using various learning algorithms based on the same dataset.It is primarily used for classification problems and can be categorized into hard voting and soft voting based on the way the final prediction is made by considering the predictions of each individual classifier.We have configured the voting mechanism to use the 'soft voting', whereby the class label prediction is derived based on the argmax of the sum of the predicted probabilities generated by each classifier.
The ensemble voting machines trained with the top-performing three, five, and seven classifiers were compared to the top-performing individual classifier itself.As the number of classifiers used in the ensemble voting machine increased, a slight improvement in performance was observed.However, there were no significant differences in performance according to the number of classifiers (Supplementary Table 3).To avoid an excessive increase in complexity, we ultimately applied an ensemble voting machine comprising the top five performing classifiers.

Model for survival prediction
We trained the random survival forest (RSF) algorithm from the sksurv's 0.

Calibration of model
Model calibration is the process of adjusting the predicted probabilities of a classifier to achieve more consistent and reliable probability estimation.By performing calibration across various subsets or regions of the data, model calibration reduces systemic biases and enhances the robustness of predictions.We calibrated the ensemble voting classifier using the CalibratedClassifierCV class from the sklearn.calibrationpackage version 0.23.2.
CalibratedClassifierCV combines cross-validation and calibration to adjust the predicted probabilities of the model.The method to use for calibration was 'sigmoid', which corresponds to a logistic regression model.The calibration process was conducted using 3-fold crossvalidation.as Resection.Additionally, we conducted the same analysis for the TACE and Resection groups in the internal dataset used for model training (Supplementary Figure 6A).Furthermore, the same analysis was performed for the matched TACE group as well (Supplementary Figure 6B).

Supplementary
We evaluated 19 different machine learning algorithms for the development of the treatment classification task: Logistic regression, Decision tree, Extra-trees, Random forest, Adaboost, Gradient boosting machine (gbm), Histogram-based gradient boosting, Xgboost, light gbm, CatBoost, Gaussian naive Bayes, Naive Bayes for multivariate Bernoulli models, Gaussian process classification, Linear discriminant analysis, Quadratic discriminant analysis, Csupport vector machine, Multi-layer perceptron, K-nearest neighbors classifier, and K-means clustering.Scikit-Learn's version 0.23.2,Xgboost's version 1.5.0,catboost's version 1.0.1, and lightgbm's version 3.2.1 were used to construct these models.All variables were normalized using Scikit-Learn's MinMaxScaler before training the machine learning algorithms.All classifiers underwent limited hyperparameter tuning during training.
15.0 version to predict individual post-treatment survival.The random survival forest algorithm is an extension of the traditional random forest algorithm to handle survival data.The RSF model is used to model the survival function, which estimates the probability of an event occurring at or after a given time.Similar to the ensemble voting machine, we utilized the default hyperparameters as indicated below, with the only exception being the random state for the reproducibility of the results.

Table 2 .
2) Institutional criteria of Asan Medical Center for liver transplantation.Baseline characteristics of the patients in the external validation datasets.
median (IQR) in parentheses.KUGH Korea University Guro Hospital.SNUBH Seoul National University Bundang Hospital.SMC Samsung Medical Center.SNUH Seoul National University Hospital.CMC Catholic Medical Center.SH Severance Hospital.CUH Chung-ang University Hospital.IUH Inha University Hospital.AFP alpha-fetoprotein.ALT alanine aminotransferase.EBRT external beam radiotherapy.ECOG Eastern Cooperative Oncology Group.INR international normalized ratio.PEIT percutaneous ethanol injection.RFA radiofrequency ablation.TACE transarterial chemoembolization.† RFA feasibility was defined as a size or location of the tumor to receive percutaneous RFA successfully without significant complications. *

Table 3 .
Performance in terms of the number of classifiers composing voting classifier.

Table 4 .
Performance of ensemble voting classifier vs. cascaded random forest model.

Table 5 .
Performance of individual training vs. external validation.KUGH Korea University Guro Hospital.SNUBH Seoul National University Bundang Hospital.SMC Samsung Medical Center.SNUH Seoul National University Hospital.CMC Catholic Medical Center.SH Severance Hospital.CUH Chung-ang University Hospital.IUH Inha University Hospital.

Table 6 .
Performance of non-calibrated model vs. calibrated model

Table 7 .
Through t-SNE visualization, we observed that the features of the TACE group in the external dataset, which were misclassified as Resection, were closely clustered with the features from the Resection group in the internal dataset used for model training.This observation was more pronounced in the matched TACE group.Our feature analysis suggests that patients with similar features, but belonging to different institutions, may receive different treatments.Baseline characteristics of the patients in the external validation datasets.Data are n (%), mean or * median (range) in parentheses.p values were calculated using the χ² test or Student t-test or Mann-Whitney U test to compare the Resection group with the TACE group and the Matched resection group with the Matched TACE group.AFP alpha-fetoprotein.ALT alanine aminotransferase.EBRT external beam radiotherapy.ECOG Eastern Cooperative Oncology Group.INR international normalized ratio.† RFA feasibility was defined as a size or location of the tumor to receive percutaneous RFA successfully without significant complications.